In [107]:
%%shell 
jupyter nbconvert --to html /content/MIE1628_A5.ipynb
[NbConvertApp] Converting notebook /content/MIE1628_A5.ipynb to html
[NbConvertApp] Writing 4364927 bytes to /content/MIE1628_A5.html
Out[107]:

image.png

Explanation of all 5 boxes:

  1. Azure Data Lake - Azure Data Lake is a scalable data storage service. ADL includes all the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages.

  2. Azure Data Bricks - Azure Data Bricks is offered as PaaS by Microsoft which is built on top of the free Apache Spark and supports other languages such as Python etc. ADB allows us spins up clusters instantly and it further integrates with other Azure services such as Azure ML, Big Data, Power BI etc.

  3. Azure Data Factory - Azure Data Factory is the platform that solves such data scenarios. It is the cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale. Using Azure Data Factory, you can create and schedule data-driven workflows (called pipelines) that can ingest data from disparate data stores. You can build complex ETL processes that transform data visually with data flows or by using compute services such as Azure HDInsight Hadoop, Azure Databricks, and Azure SQL Database.

  4. Azure Synapse Analytics - Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated options – at scale. Azure Synapse brings these worlds together with a unified experience to ingest, explore, prepare, transform, manage and serve data for immediate BI and machine learning needs.

  5. Azure Cosmos DB - Cosmos Database (DB) is a globally distributed, low latency, multi-model ( key-value, graph, document,column) database for managing data at large scales. It is a cloud-based NoSQL database offered as a PaaS (Platform as a Service) from Microsoft Azure. It is a highly available, high throughput, reliable database and is often called a serverless database. Cosmos database contains the Azure Document DB and is available everywhere.

Which blocks will go where ?

  1. Ingest Data -> Azure Data Factory : This allows us to use lift and shift the data to cloud there are few options like copy to blob storage. User can create automated pipelines and create triggers to execute the pipelines as when required. So good example use case it when we get raw data on a daily basis, we can create automated triggers to run and ingest the data.

  2. Data Storage -> Azure Data Storage : As described above, ADL is a scalable data storage service, once ADF is used to ingest the data, we need a place to store data and ADL is perfect place to keep small-large amounts of data. One good benefit of ADL is we can storage different types of data such as text, logs, video, document etc.

  3. Prepare and Transform Data -> Azure DataBricks : One we have ingested raw data, we need to derive meaningful insights from the data and before that might have do some pre-processing such as ETL/ELT, perform data manipulation etc. For these purpose Azure DataBricks is peferred due to its support on Big Data solutions as Pyspark, MLLib etc. Another option to peform ELT/ELT options can be also be done via Azure Data Factory. Hence even Azure Data Factory could be a good fit in these case. However at these stage we would ideally want to perform some visulzations, perform data manipulations, combine data etc and as such Azure DataBricks would be more preffered option in this case.

  1. Model and Serve Data : In my opinion we have two blocks in this case as explained below

i ) Azure Synapse Analytics : SaaS offered by MS which provides and end to end datawarehouse capabilities. Synapse is more of a unified world where we can perform ingestion, explorations, ETL/ELT and serve data for Business Intelligence and Machine learning needs. However in this case, Synapse will integrate with Databricks ETL/ELT and provide data insights and ML capabilities.

ii) Azure Cosmos DB : Once we perform all the analytics on the organized data, there might be a requirement that data should stored in a structured way to cater to the real world. For e.g This could be storing of logs of a deployed web application. A developer might be interested in looking at the logs received to monitor the application. Other use cases could be like to insert a new row of data whenever a sensor reading is there and alert the user. The user can check the logs of of sensor reading of last 1hr to understand what has happened. Hence Azure Cosmos DB would be good fit to serve the data to the user in a desriable way.

image.png

Azure Stream Analytics is a fully managed stream processing engine that is designed to analyze and process large volumes of streaming data with sub-millisecond latencies. Patterns and relationships can be identified in data that originates from a variety of input sources including applications, devices, sensors, clickstreams, and social media feeds. These patterns can be used to trigger actions and initiate workflows such as creating alerts, feeding information to a reporting tool, or storing transformed data for later use.

An Azure Stream Analytics job consists of an input, query, and an output. It ingests data from Azure Event Hubs, Azure IoT Hub, or Azure Blob Storage.

  • An Azure Stream Analytics job consists of an input, query, and an output.
  • The input data can be from IoT device, weblogs, webclick or any other real time streaming data.
  • It ingests data from Azure Event Hubs, Azure IoT Hub, or Azure Blob Storage.
  • The query is based on SQL query language and can be used to easily filter, sort, aggregate, and join streaming data.
  • The output can be use to integrate with SQL DB, Cosmos DB, Power BI and other Azure Services

image.png

IOT RESOURCE GROUP DEPLOYED

image.png

image.png

image.png

image.png

PART B

Q1 Problem Statement

Dataset Link - https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29

Dataset Description -The dataset provides patient reviews on specific drugs along with related conditions and a 10 star patient rating reflecting overall patient satisfaction.The data was obtained by crawling online pharmaceutical review sites.

Attribute Information -

  1. drugName (categorical): name of drug
  2. condition (categorical): name of condition
  3. review (text): patient review
  4. rating (numerical): 10 star patient rating
  5. date (date): date of review entry
  6. usefulCount (numerical): number of users who found review useful

Problem Statement

  1. To perform EDA, understand general trends on drug usage, conditons suffered by patients etc.

  2. Sentiment Analysis on the reviews

  3. Predict the rating of drug based on the reviews provided.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os
In [2]:
df_test=pd.read_csv('drugsComTest_raw.tsv', sep='\t')
df_train=pd.read_csv('drugsComTrain_raw.tsv', sep='\t')
In [3]:
df_train.head()
Out[3]:
Unnamed: 0 drugName condition review rating date usefulCount
0 206461 Valsartan Left Ventricular Dysfunction "It has no side effect, I take it in combinati... 9.0 May 20, 2012 27
1 95260 Guanfacine ADHD "My son is halfway through his fourth week of ... 8.0 April 27, 2010 192
2 92703 Lybrel Birth Control "I used to take another oral contraceptive, wh... 5.0 December 14, 2009 17
3 138000 Ortho Evra Birth Control "This is my first time using any form of birth... 8.0 November 3, 2015 10
4 35696 Buprenorphine / naloxone Opiate Dependence "Suboxone has completely turned my life around... 9.0 November 27, 2016 37
In [4]:
df_test.head()
Out[4]:
Unnamed: 0 drugName condition review rating date usefulCount
0 163740 Mirtazapine Depression "I've tried a few antidepressants over th... 10.0 February 28, 2012 22
1 206473 Mesalamine Crohn's Disease, Maintenance "My son has Crohn's disease and has done ... 8.0 May 17, 2009 17
2 159672 Bactrim Urinary Tract Infection "Quick reduction of symptoms" 9.0 September 29, 2017 3
3 39293 Contrave Weight Loss "Contrave combines drugs that were used for al... 9.0 March 5, 2017 35
4 97768 Cyclafem 1 / 35 Birth Control "I have been on this birth control for one cyc... 9.0 October 22, 2015 4
In [5]:
df_train.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161297 entries, 0 to 161296
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Unnamed: 0   161297 non-null  int64  
 1   drugName     161297 non-null  object 
 2   condition    160398 non-null  object 
 3   review       161297 non-null  object 
 4   rating       161297 non-null  float64
 5   date         161297 non-null  object 
 6   usefulCount  161297 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 8.6+ MB
In [6]:
df_test.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53766 entries, 0 to 53765
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Unnamed: 0   53766 non-null  int64  
 1   drugName     53766 non-null  object 
 2   condition    53471 non-null  object 
 3   review       53766 non-null  object 
 4   rating       53766 non-null  float64
 5   date         53766 non-null  object 
 6   usefulCount  53766 non-null  int64  
dtypes: float64(1), int64(2), object(4)
memory usage: 2.9+ MB

We already see there are some null values present in the column condition. Let us see how many of them are there and what is % ratio of it for both the test and train data.

In [7]:
df_test.describe()
Out[7]:
Unnamed: 0 rating usefulCount
count 53766.000000 53766.000000 53766.000000
mean 116386.701187 6.976900 27.989752
std 67017.739881 3.285207 36.172833
min 0.000000 1.000000 0.000000
25% 58272.500000 4.000000 6.000000
50% 116248.500000 8.000000 16.000000
75% 174586.750000 10.000000 36.000000
max 232284.000000 10.000000 949.000000
In [8]:
df_train.describe()
Out[8]:
Unnamed: 0 rating usefulCount
count 161297.000000 161297.000000 161297.000000
mean 115923.585305 6.994377 28.004755
std 67004.445170 3.272329 36.403742
min 2.000000 1.000000 0.000000
25% 58063.000000 5.000000 6.000000
50% 115744.000000 8.000000 16.000000
75% 173776.000000 10.000000 36.000000
max 232291.000000 10.000000 1291.000000
In [9]:
df_train.isnull().any()
Out[9]:
Unnamed: 0     False
drugName       False
condition       True
review         False
rating         False
date           False
usefulCount    False
dtype: bool
In [10]:
df_test.isnull().any()
Out[10]:
Unnamed: 0     False
drugName       False
condition       True
review         False
rating         False
date           False
usefulCount    False
dtype: bool
In [11]:
df_train['condition'].isnull().sum()
Out[11]:
899
In [12]:
df_test['condition'].isnull().sum()
Out[12]:
295
In [13]:
652/114513
Out[13]:
0.005693676700461957
In [14]:
295/53766
Out[14]:
0.0054867388312316336

Both the dataset contain about 0.5% of null values. Hence a total summation of close to 1%. We can do the following things :

  1. Drop the null values

  2. Try to fill in the empty values using histroical data present in dataset and performing a mapping. However this might be not be useful as there could be more than one drug treating multiple conditions and choosing any drug at random or the most common one could introduce bias. However our main goal to predict the rating based on reviews received. Had it been the other away around e.g predict the conditions based on the reviews it would have been different story.

    Before preprocessing let us perform some EDA by combining both the datasets to visualize the data.

In [15]:
data=pd.concat([df_train,df_test])
In [16]:
data.shape
Out[16]:
(215063, 7)

Why merge and pristine data ?

  1. The dataset came with data already split ?! I could not find any reason why they have split the dataset without pre-processing even though both have null value and textual data often requires some cleaning to remove stopwords etc to better peform on sentimental analysis and on tf-idf algorithms.

  2. While it is the best practice to keep a test set always pristine where no one has looked into it. I am afraid that if I clean my train data properly and the do not clean the test data equally good. Whatever model I perform might be not be upto the mark. Hence I decided to merge the data and clean it in one go. Mergining in this case also helps me to analyse the all the data records in one go also.

Q2. EDA

In [17]:
data_con=data['condition'].value_counts()
In [18]:
data_con.shape
Out[18]:
(916,)

Chart 1 : Top and bottom 20 conditions present in the dataset.

Since it is not possible to plot all 916 features we will plot the first and last 20 features to get some idea.

In [19]:
data_con[0:20].plot.bar(rot=90)
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1cc79bb90>
In [20]:
data_con[-20:].plot.bar(rot=90)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1cc64a110>

Comments on Chart 1 : Top & Bottom 20 Conditions

  • Birth Control tops the chart with close to 38,000 users followed by depression (12,000).
  • Depression, Pain, Acne, Anxiety, Insomnia are other top common conditions faced by people. These are quite common conditions which we keep hearing every now and then. One point to note here is Pain is a very broad category.
  • Assumption : The general public ( non-medical) would have heard and recongized about 10 of these conditions easily. Even though this dataset was released in the year 2018.
  • Straight away we see some noise in data starting with </span> users found this comment helpful. We have to remove this.
  • Assumption : Most of general public would not have heard most of the conditions
  • A casual google search "like number of people of suffering from Hodgkin's Lymphobia" (similary Wilson disease, what is Cogan's syndrome etc) revelaed most of there are extremely rare conditions.
In [21]:
### Chart 2 - Distrubution of Ratings

data_ratings=data['rating'].value_counts()
In [22]:
data_ratings
Out[22]:
10.0    68005
9.0     36708
1.0     28918
8.0     25046
7.0     12547
5.0     10723
2.0      9265
3.0      8718
6.0      8462
4.0      6671
Name: rating, dtype: int64
In [23]:
data_ratings.plot.pie()
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1cc0f0710>

#Comments on Chart 2 : Distribution of ratings

  • Drugs with rating 10 tops the chart followed by rating 9 and rating 1.
  • If we are to consider a rating > 5 to be positive rating and < 5 to be can already see that the dataset has more positive reviews. We will explore the imbalance in dataset at a later stage.
In [24]:
### Chart 3 Top 20 drugs with rating 10/10 and Top 20 drugs with rating 1/10

top_drugs=data.loc[data.rating==10,"drugName"].value_counts()
In [25]:
top_drugs[:20].plot.bar(rot=90)
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1cc071550>
In [26]:
bottom_drugs=data.loc[data.rating==1,"drugName"].value_counts()
bottom_drugs[:20].plot.bar(rot=90)
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1cbfbe550>

Comments on Chart 3 : Top & Bottom drugs with 10/10 rating and 1/10 rating

  • Levonorgestrel which tops the rating chart is used for birth control which tops the conditions. It can be infered that most likely levonorgestrel was chosen as best drug for people taking medication for birth control however this drug makes a appreance in the drugs rated 1.0 with
  • Phentermine which seconds the list is used for weight loss and obesity.
  • Most of the drugs used are alternate options for birth control, weight loss etc.
  • Miconzole which tops with rating of 1.0 is used for vaginal yeast infection.
  • Some other conditons which make appreance are birth control, weight loss etc.
  • Quite naturally what works for some do not esentially work for another. Some alternatives work good some do not. Medical drugs are extremely complicated and their reaction in treatement is different since lot of factors influence the treatement.
In [27]:
### Plot 4 on time-series data

pd_dt=pd.to_datetime(data['date'])
In [28]:
pd_dt
Out[28]:
0       2012-05-20
1       2010-04-27
2       2009-12-14
3       2015-11-03
4       2016-11-27
           ...    
53761   2014-09-13
53762   2016-10-08
53763   2010-11-15
53764   2011-11-28
53765   2009-09-13
Name: date, Length: 215063, dtype: datetime64[ns]
In [29]:
pd_dt_counts=pd_dt.value_counts()
In [30]:
pd_dt_counts
Out[30]:
2016-03-01    185
2016-03-31    183
2017-01-18    182
2015-12-15    181
2016-03-02    181
             ... 
2008-05-18      5
2017-12-07      5
2008-02-24      5
2008-05-17      4
2008-12-20      4
Name: date, Length: 3579, dtype: int64
In [31]:
#year plot 
yr=pd_dt.dt.year
yr.value_counts().plot.bar(rot=90)
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1cbf20290>
In [32]:
#month plot
mt=pd_dt.dt.month
mt.value_counts().plot.bar(rot=90)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1cbf5d790>
In [33]:
data['year']=yr
data.groupby('year')['condition'].nunique().plot.bar(rot=90)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1cbdd3890>
In [34]:
data.groupby('year')['drugName'].nunique().plot.bar(rot=90)
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1cbe20750>

Comments on Plot 4 : Time series data

  • 2008 has the least number of reviews and over the period the number of reviews increased and recorded the maximum at the year 2016.

  • Since this dataset was generated by crawling online pharma websites, one can understand that 2008 was the early stages of internet boom.

  • In the month wise distribution decemeber has the least reviews, where Aug records the maximum records. All other months has equal distributions.

  • Number of conditions were periodically increasing and decreasing till 2014, after which there was a rise till 2017. Does the number of conditions increase lead to rise in drugs also ?

  • Yes, the number of unquie drugs names over the years follows the exact same pattern as that of conditions

Plot 5 Number of Drugs per condition

In [37]:
data.groupby('condition')['drugName'].nunique().sort_values(ascending=False).head(50)
Out[37]:
condition
Not Listed / Othe                             253
Pain                                          219
Birth Control                                 181
High Blood Pressure                           146
Acne                                          127
Depression                                    115
Rheumatoid Arthritis                          107
Diabetes, Type 2                               97
Allergic Rhinitis                              95
Insomnia                                       85
Osteoarthritis                                 84
Bipolar Disorde                                82
Anxiety                                        81
Abnormal Uterine Bleeding                      77
Endometriosis                                  64
3</span> users found this comment helpful.     62
Psoriasis                                      61
Migraine                                       60
ADHD                                           58
4</span> users found this comment helpful.     57
Asthma, Maintenance                            56
1</span> users found this comment helpful.     54
2</span> users found this comment helpful.     54
Urinary Tract Infection                        53
Chronic Pain                                   53
Irritable Bowel Syndrome                       52
Major Depressive Disorde                       52
6</span> users found this comment helpful.     52
Migraine Prevention                            51
Postmenopausal Symptoms                        47
Bronchitis                                     47
0</span> users found this comment helpful.     47
Bacterial Infection                            46
HIV Infection                                  46
7</span> users found this comment helpful.     46
ibromyalgia                                    46
5</span> users found this comment helpful.     46
GERD                                           45
Dermatitis                                     45
Constipation                                   44
Obesity                                        43
Back Pain                                      43
Headache                                       43
Cough                                          43
Nasal Congestion                               42
Eczema                                         42
Cold Symptoms                                  42
Sinusitis                                      42
Nausea/Vomiting                                40
Multiple Sclerosis                             40
Name: drugName, dtype: int64
In [38]:
data.groupby('condition')['drugName'].nunique().sort_values(ascending=False)[0:20].plot.bar(rot=90)
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1cbbcae90>

Comments on Chart 5

  • We see that pain and birth control have the hightest amount of unique drungs among the conditions.
  • Again we notice lot of noise in the dataset which needs to be cleaned.

Q3 [Marks: 15] Do data cleaning/pre-processing as required and explain what you have done for your dataset and why?

In [39]:
pd.set_option('display.max_colwidth', None)

Removing Null values first

In [40]:
data.isnull().any()
Out[40]:
Unnamed: 0     False
drugName       False
condition       True
review         False
rating         False
date           False
usefulCount    False
year           False
dtype: bool
In [41]:
data.shape
Out[41]:
(215063, 8)
In [42]:
# Dropping the data points with null values as it's very much less than 1% of the whole dataset
data = data.dropna(how = 'any', axis = 0)

print ("The shape of the dataset after null values removal :", data.shape)
The shape of the dataset after null values removal : (213869, 8)

Removing the rows with "span" data.

In [43]:
span_data = data[data['condition'].str.contains('</span>',case=False,regex=True) == True]
print('Number of rows with </span> values : ', len(span_data))
noisy_data_ = 100 * (len(span_data)/data.shape[0])
print('Total percent of noisy data {} %  '.format(noisy_data_))
Number of rows with </span> values :  1171
Total percent of noisy data 0.5475314327929715 %  
In [44]:
data.drop(span_data.index, axis = 0, inplace=True)
In [45]:
data.shape
Out[45]:
(212126, 8)

Removing the 'not listed/other conditions'

In [46]:
#check the percentage of 'not listed / othe' conditions
not_listed = data[data['condition'].str.contains('Not Listed / Othe', case=False, regex=True)==True]
print('Number of not_listed values : ', len(not_listed))
percent_not_listed = 100 * len(not_listed)/data.shape[0]
print('Total percent of noisy data {} %  '.format(percent_not_listed))
Number of not_listed values :  590
Total percent of noisy data 0.27813657920292656 %  
In [47]:
#check the percentage of 'not listed / othe' conditions
not_listed = data[data['condition']=='Not Listed / Othe']
print('Number of not_listed values : ', len(not_listed))
percent_not_listed = 100 * len(not_listed)/data.shape[0]
print('Total percent of noisy data {} %  '.format(percent_not_listed))
Number of not_listed values :  590
Total percent of noisy data 0.27813657920292656 %  
In [48]:
data.drop(not_listed.index, axis = 0, inplace=True)
In [49]:
data.shape
Out[49]:
(211247, 8)
In [50]:
#quickly checking once again if there are any changes.
data.groupby('condition')['drugName'].nunique().sort_values(ascending=False).head(100)
Out[50]:
condition
Pain                   219
Birth Control          181
High Blood Pressure    146
Acne                   127
Depression             114
                      ... 
Ovarian Cysts           22
Weight Loss             22
Gout, Acute             21
Stomach Ulce            21
Psoriatic Arthritis     21
Name: drugName, Length: 100, dtype: int64
In [51]:
print("Total loss in data is ", (215063-211247)/215063)
Total loss in data is  0.017743637910751734

Textual Data Cleaning

Steps for reviews pre-processing.

  • Remove HTML tags
    • Using BeautifulSoup from bs4 module to remove the html tags. We have already removed the html tags with pattern "64</span>...", we will use get_text() to remove the html tags if there are any.
  • Remove Stop Words
    • Remove the stopwords like "a", "the", "I" etc.
  • Remove symbols and special characters
    • We will remove the special characters from our reviews like '#' ,'&' ,'@' etc.
  • Tokenize
    • We will tokenize the words. We will split the sentences with spaces e.g "I might come" --> "I", "might", "come"
  • Stemming
    • Remove the suffixes from the words to get the root form of the word e.g 'Wording' --> "Word"
In [52]:
pip install nltk
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (3.7)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from nltk) (7.1.2)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from nltk) (4.64.0)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.7/dist-packages (from nltk) (2022.6.2)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from nltk) (1.1.0)
In [53]:
data
Out[53]:
Unnamed: 0 drugName condition review rating date usefulCount year
0 206461 Valsartan Left Ventricular Dysfunction "It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil" 9.0 May 20, 2012 27 2012
1 95260 Guanfacine ADHD "My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \r\nWe have tried many different medications and so far this is the most effective." 8.0 April 27, 2010 192 2010
2 92703 Lybrel Birth Control "I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\r\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas." 5.0 December 14, 2009 17 2009
3 138000 Ortho Evra Birth Control "This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch" 8.0 November 3, 2015 10 2015
4 35696 Buprenorphine / naloxone Opiate Dependence "Suboxone has completely turned my life around. I feel healthier, I&#039;m excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you&#039;re ready to stop, there&#039;s a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I&#039;m actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin." 9.0 November 27, 2016 37 2016
... ... ... ... ... ... ... ... ...
53761 159999 Tamoxifen Breast Cancer, Prevention "I have taken Tamoxifen for 5 years. Side effects are severe sweating and depression. I have been taking Effexor XR longer than I have been on Tamoxifen. My Oncologist increased the Effexor dosage from 75 mg to 150 mg per day. She assure me the Effexor and Black Cohoosh would STOP the sweating...NOT. SWEATING INCREASED AND I AM MORE DEPRESSED THAN EVER. I had a sonogram last month that revealed a very small fibroid and fluid in my uterus. Got an appointment with GYN next week to see how she wants to handle the uterus problem. " 10.0 September 13, 2014 43 2014
53762 140714 Escitalopram Anxiety "I&#039;ve been taking Lexapro (escitaploprgram) since February. First, I&#039;d like to mention that you can NOT take this drug for a week or less and expect to magically feel better; I felt really sick the first two weeks on this drug. But you HAVE to give the drug time. For me, I didn&#039;t really start noticing the drugs positive effects for about two months. I took Zoloft before this and felt like it made me too tired and absent-minded. Luckily, Lexapro doesn&#039;t seem to have this effect (although I do drink caffeinated drinks). I like Lexapro not only because my anxiety and depression is completely gone, but I feel like I can finally handle everything in my life now (I&#039;m a working full-time college student). I highly recommend this drug." 9.0 October 8, 2016 11 2016
53763 130945 Levonorgestrel Birth Control "I&#039;m married, 34 years old and I have no kids. Taking the pill was such a hassle so I decided to get the Mirena. It was very painful when it was inserted,then had cramping for the rest of that day! For the first 6 weeks I spotted off and on and then my periods just stopped. I still got cramps every few months, but never needed to take anything. The 5th and final year of me having Mirena, I started to spot monthly. I called my OB, they just said that the levonorgestrel wears off over time. I made the decision that I was going to have another one put in. Taking the old one out didn&#039;t hurt at all, you feel a little pressure. Inserting the new one was less painful but still uncomfortable. It&#039;s been 4 days and I still have a little cramping but overall HAPPY." 8.0 November 15, 2010 7 2010
53764 47656 Tapentadol Pain "I was prescribed Nucynta for severe neck/shoulder pain. After taking only 2, 75mg pills I was rushed to the ER with severe breathing problems. I have never had any issues with pain medicines before." 1.0 November 28, 2011 20 2011
53765 113712 Arthrotec Sciatica "It works!!!" 9.0 September 13, 2009 46 2009

211247 rows × 8 columns

In [54]:
#import the libraries for pre-processing
from bs4 import BeautifulSoup
import nltk
nltk.download('stopwords')
import re
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

stops = set(stopwords.words('english')) #english stopwords

stemmer = SnowballStemmer('english') #SnowballStemmer

def review_to_words(raw_review):
    # 1. Delete HTML 
    review_text = BeautifulSoup(raw_review, 'html.parser').get_text()
    # 2. Remove all other letters apart from ASCII English letter ( basically remove all digits, puncuations etc.)
    letters_only = re.sub('[^a-zA-Z]', ' ', review_text)
    # 3. lower letters
    words = letters_only.lower().split()
    # 5. Stopwords 
    meaningful_words = [w for w in words if not w in stops]
    # 6. Stemming
    stemming_words = [stemmer.stem(w) for w in meaningful_words]
    # 7. space join words
    return( ' '.join(stemming_words))
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
In [55]:
data['review_clean']=data['review'].apply(review_to_words)
In [56]:
data.head(10)
Out[56]:
Unnamed: 0 drugName condition review rating date usefulCount year review_clean
0 206461 Valsartan Left Ventricular Dysfunction "It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil" 9.0 May 20, 2012 27 2012 side effect take combin bystol mg fish oil
1 95260 Guanfacine ADHD "My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \r\nWe have tried many different medications and so far this is the most effective." 8.0 April 27, 2010 192 2010 son halfway fourth week intuniv becam concern began last week start take highest dose two day could hard get bed cranki slept near hour drive home school vacat unusu call doctor monday morn said stick day see school get morn last two day problem free much agreeabl ever less emot good thing less cranki rememb thing overal behavior better tri mani differ medic far effect
2 92703 Lybrel Birth Control "I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\r\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas." 5.0 December 14, 2009 17 2009 use take anoth oral contracept pill cycl happi light period max day side effect contain hormon gestoden avail us switch lybrel ingredi similar pill end start lybrel immedi first day period instruct said period last two week take second pack two week third pack thing got even wors third period last two week end third week still daili brown discharg posit side side effect idea period free tempt ala
3 138000 Ortho Evra Birth Control "This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch" 8.0 November 3, 2015 10 2015 first time use form birth control glad went patch month first decreas libido subsid downsid made period longer day exact use period day max also made cramp intens first two day period never cramp use birth control happi patch
4 35696 Buprenorphine / naloxone Opiate Dependence "Suboxone has completely turned my life around. I feel healthier, I&#039;m excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you&#039;re ready to stop, there&#039;s a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I&#039;m actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin." 9.0 November 27, 2016 37 2016 suboxon complet turn life around feel healthier excel job alway money pocket save account none suboxon spent year abus oxycontin paycheck alreadi spent time got start resort scheme steal fund addict histori readi stop good chanc suboxon put path great life found side effect minim compar oxycontin actual sleep better slight constip truli amaz cost pale comparison spent oxycontin
5 155963 Cialis Benign Prostatic Hyperplasia "2nd day on 5mg started to work with rock hard erections however experianced headache, lower bowel preassure. 3rd day erections would wake me up &amp; hurt! Leg/ankles aches severe lower bowel preassure like you need to go #2 but can&#039;t! Enjoyed the initial rockhard erections but not at these side effects or $230 for months supply! I&#039;m 50 &amp; work out 3Xs a week. Not worth side effects!" 2.0 November 28, 2015 43 2015 nd day mg start work rock hard erect howev experianc headach lower bowel preassur rd day erect would wake hurt leg ankl ach sever lower bowel preassur like need go enjoy initi rockhard erect side effect month suppli work xs week worth side effect
6 165907 Levonorgestrel Emergency Contraception "He pulled out, but he cummed a bit in me. I took the Plan B 26 hours later, and took a pregnancy test two weeks later - - I&#039;m pregnant." 1.0 March 7, 2017 5 2017 pull cum bit took plan b hour later took pregnanc test two week later pregnant
7 102654 Aripiprazole Bipolar Disorde "Abilify changed my life. There is hope. I was on Zoloft and Clonidine when I first started Abilify at the age of 15.. Zoloft for depression and Clondine to manage my complete rage. My moods were out of control. I was depressed and hopeless one second and then mean, irrational, and full of rage the next. My Dr. prescribed me 2mg of Abilify and from that point on I feel like I have been cured though I know I&#039;m not.. Bi-polar disorder is a constant battle. I know Abilify works for me because I have tried to get off it and lost complete control over my emotions. Went back on it and I was golden again. I am on 5mg 2x daily. I am now 21 and better than I have ever been in the past. Only side effect is I like to eat a lot." 10.0 March 14, 2015 32 2015 abilifi chang life hope zoloft clonidin first start abilifi age zoloft depress clondin manag complet rage mood control depress hopeless one second mean irrat full rage next dr prescrib mg abilifi point feel like cure though know bi polar disord constant battl know abilifi work tri get lost complet control emot went back golden mg x daili better ever past side effect like eat lot
8 74811 Keppra Epilepsy " I Ve had nothing but problems with the Keppera : constant shaking in my arms &amp; legs &amp; pins &amp; needles feeling in my arms &amp; legs severe light headedness no appetite &amp; etc." 1.0 August 9, 2016 11 2016 noth problem keppera constant shake arm leg pin needl feel arm leg sever light headed appetit etc
9 48928 Ethinyl estradiol / levonorgestrel Birth Control "I had been on the pill for many years. When my doctor changed my RX to chateal, it was as effective. It really did help me by completely clearing my acne, this takes about 6 months though. I did not gain extra weight, or develop any emotional health issues. I stopped taking it bc I started using a more natural method of birth control, but started to take it bc I hate that my acne came back at age 28. I really hope symptoms like depression, or weight gain do not begin to affect me as I am older now. I&#039;m also naturally moody, so this may worsen things. I was in a negative mental rut today. Also I hope this doesn&#039;t push me over the edge, as I believe I am depressed. Hopefully it&#039;ll be just like when I was younger." 8.0 December 8, 2016 1 2016 pill mani year doctor chang rx chateal effect realli help complet clear acn take month though gain extra weight develop emot health issu stop take bc start use natur method birth control start take bc hate acn came back age realli hope symptom like depress weight gain begin affect older also natur moodi may worsen thing negat mental rut today also hope push edg believ depress hope like younger

Comments after cleaning the Textual data

  • Comparing review_clean and review gives the overall pictures on what has happened
  • Cleaning textual data for sentiment analysis and vecotorization is quite important as we want only the higher order important words which are unqiue and help us in our classficiation task.

Perform Sentiment Analysis

In [58]:
from textblob import TextBlob
In [59]:
sentiment_polarity=[]
for review in data['review_clean']:
  blob=TextBlob(review)
  sentiment_polarity+=[blob.sentiment.polarity]
In [60]:
data['sentiment_polarity']=sentiment_polarity
In [61]:
textblob_dist=pd.DataFrame({'values':[np.sum(data['sentiment_polarity']>0),np.sum(data['sentiment_polarity']<0),np.sum(data['sentiment_polarity']==0)]},index=['positive','negative','neutral'])
In [62]:
textblob_dist.plot.pie(y='values')
print(textblob_dist)
          values
positive  133186
negative   56786
neutral    21275
In [63]:
data.head(10)
Out[63]:
Unnamed: 0 drugName condition review rating date usefulCount year review_clean sentiment_polarity
0 206461 Valsartan Left Ventricular Dysfunction "It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil" 9.0 May 20, 2012 27 2012 side effect take combin bystol mg fish oil 0.000000
1 95260 Guanfacine ADHD "My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \r\nWe have tried many different medications and so far this is the most effective." 8.0 April 27, 2010 192 2010 son halfway fourth week intuniv becam concern began last week start take highest dose two day could hard get bed cranki slept near hour drive home school vacat unusu call doctor monday morn said stick day see school get morn last two day problem free much agreeabl ever less emot good thing less cranki rememb thing overal behavior better tri mani differ medic far effect 0.114583
2 92703 Lybrel Birth Control "I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\r\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas." 5.0 December 14, 2009 17 2009 use take anoth oral contracept pill cycl happi light period max day side effect contain hormon gestoden avail us switch lybrel ingredi similar pill end start lybrel immedi first day period instruct said period last two week take second pack two week third pack thing got even wors third period last two week end third week still daili brown discharg posit side side effect idea period free tempt ala 0.105000
3 138000 Ortho Evra Birth Control "This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch" 8.0 November 3, 2015 10 2015 first time use form birth control glad went patch month first decreas libido subsid downsid made period longer day exact use period day max also made cramp intens first two day period never cramp use birth control happi patch 0.300000
4 35696 Buprenorphine / naloxone Opiate Dependence "Suboxone has completely turned my life around. I feel healthier, I&#039;m excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you&#039;re ready to stop, there&#039;s a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I&#039;m actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin." 9.0 November 27, 2016 37 2016 suboxon complet turn life around feel healthier excel job alway money pocket save account none suboxon spent year abus oxycontin paycheck alreadi spent time got start resort scheme steal fund addict histori readi stop good chanc suboxon put path great life found side effect minim compar oxycontin actual sleep better slight constip truli amaz cost pale comparison spent oxycontin 0.147037
5 155963 Cialis Benign Prostatic Hyperplasia "2nd day on 5mg started to work with rock hard erections however experianced headache, lower bowel preassure. 3rd day erections would wake me up &amp; hurt! Leg/ankles aches severe lower bowel preassure like you need to go #2 but can&#039;t! Enjoyed the initial rockhard erections but not at these side effects or $230 for months supply! I&#039;m 50 &amp; work out 3Xs a week. Not worth side effects!" 2.0 November 28, 2015 43 2015 nd day mg start work rock hard erect howev experianc headach lower bowel preassur rd day erect would wake hurt leg ankl ach sever lower bowel preassur like need go enjoy initi rockhard erect side effect month suppli work xs week worth side effect 0.136111
6 165907 Levonorgestrel Emergency Contraception "He pulled out, but he cummed a bit in me. I took the Plan B 26 hours later, and took a pregnancy test two weeks later - - I&#039;m pregnant." 1.0 March 7, 2017 5 2017 pull cum bit took plan b hour later took pregnanc test two week later pregnant 0.111111
7 102654 Aripiprazole Bipolar Disorde "Abilify changed my life. There is hope. I was on Zoloft and Clonidine when I first started Abilify at the age of 15.. Zoloft for depression and Clondine to manage my complete rage. My moods were out of control. I was depressed and hopeless one second and then mean, irrational, and full of rage the next. My Dr. prescribed me 2mg of Abilify and from that point on I feel like I have been cured though I know I&#039;m not.. Bi-polar disorder is a constant battle. I know Abilify works for me because I have tried to get off it and lost complete control over my emotions. Went back on it and I was golden again. I am on 5mg 2x daily. I am now 21 and better than I have ever been in the past. Only side effect is I like to eat a lot." 10.0 March 14, 2015 32 2015 abilifi chang life hope zoloft clonidin first start abilifi age zoloft depress clondin manag complet rage mood control depress hopeless one second mean irrat full rage next dr prescrib mg abilifi point feel like cure though know bi polar disord constant battl know abilifi work tri get lost complet control emot went back golden mg x daili better ever past side effect like eat lot 0.047756
8 74811 Keppra Epilepsy " I Ve had nothing but problems with the Keppera : constant shaking in my arms &amp; legs &amp; pins &amp; needles feeling in my arms &amp; legs severe light headedness no appetite &amp; etc." 1.0 August 9, 2016 11 2016 noth problem keppera constant shake arm leg pin needl feel arm leg sever light headed appetit etc 0.200000
9 48928 Ethinyl estradiol / levonorgestrel Birth Control "I had been on the pill for many years. When my doctor changed my RX to chateal, it was as effective. It really did help me by completely clearing my acne, this takes about 6 months though. I did not gain extra weight, or develop any emotional health issues. I stopped taking it bc I started using a more natural method of birth control, but started to take it bc I hate that my acne came back at age 28. I really hope symptoms like depression, or weight gain do not begin to affect me as I am older now. I&#039;m also naturally moody, so this may worsen things. I was in a negative mental rut today. Also I hope this doesn&#039;t push me over the edge, as I believe I am depressed. Hopefully it&#039;ll be just like when I was younger." 8.0 December 8, 2016 1 2016 pill mani year doctor chang rx chateal effect realli help complet clear acn take month though gain extra weight develop emot health issu stop take bc start use natur method birth control start take bc hate acn came back age realli hope symptom like depress weight gain begin affect older also natur moodi may worsen thing negat mental rut today also hope push edg believ depress hope like younger -0.085185

Compare the results with Vader

In [64]:
pip install vadersentiment
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting vadersentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
     |████████████████████████████████| 125 kB 27.9 MB/s 
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from vadersentiment) (2.23.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->vadersentiment) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->vadersentiment) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->vadersentiment) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->vadersentiment) (2022.6.15)
Installing collected packages: vadersentiment
Successfully installed vadersentiment-3.3.2
In [65]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
v_intensity=SentimentIntensityAnalyzer()
v_scores=[]
for review in data['review_clean']:
  compound_value=v_intensity.polarity_scores(review)
  v_scores.append(compound_value['compound'])
data['vader_intensity']=v_scores

data.head(10)
Out[65]:
Unnamed: 0 drugName condition review rating date usefulCount year review_clean sentiment_polarity vader_intensity
0 206461 Valsartan Left Ventricular Dysfunction "It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil" 9.0 May 20, 2012 27 2012 side effect take combin bystol mg fish oil 0.000000 0.0000
1 95260 Guanfacine ADHD "My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \r\nWe have tried many different medications and so far this is the most effective." 8.0 April 27, 2010 192 2010 son halfway fourth week intuniv becam concern began last week start take highest dose two day could hard get bed cranki slept near hour drive home school vacat unusu call doctor monday morn said stick day see school get morn last two day problem free much agreeabl ever less emot good thing less cranki rememb thing overal behavior better tri mani differ medic far effect 0.114583 0.6929
2 92703 Lybrel Birth Control "I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\r\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas." 5.0 December 14, 2009 17 2009 use take anoth oral contracept pill cycl happi light period max day side effect contain hormon gestoden avail us switch lybrel ingredi similar pill end start lybrel immedi first day period instruct said period last two week take second pack two week third pack thing got even wors third period last two week end third week still daili brown discharg posit side side effect idea period free tempt ala 0.105000 0.5106
3 138000 Ortho Evra Birth Control "This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch" 8.0 November 3, 2015 10 2015 first time use form birth control glad went patch month first decreas libido subsid downsid made period longer day exact use period day max also made cramp intens first two day period never cramp use birth control happi patch 0.300000 0.4199
4 35696 Buprenorphine / naloxone Opiate Dependence "Suboxone has completely turned my life around. I feel healthier, I&#039;m excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you&#039;re ready to stop, there&#039;s a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I&#039;m actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin." 9.0 November 27, 2016 37 2016 suboxon complet turn life around feel healthier excel job alway money pocket save account none suboxon spent year abus oxycontin paycheck alreadi spent time got start resort scheme steal fund addict histori readi stop good chanc suboxon put path great life found side effect minim compar oxycontin actual sleep better slight constip truli amaz cost pale comparison spent oxycontin 0.147037 0.8934
5 155963 Cialis Benign Prostatic Hyperplasia "2nd day on 5mg started to work with rock hard erections however experianced headache, lower bowel preassure. 3rd day erections would wake me up &amp; hurt! Leg/ankles aches severe lower bowel preassure like you need to go #2 but can&#039;t! Enjoyed the initial rockhard erections but not at these side effects or $230 for months supply! I&#039;m 50 &amp; work out 3Xs a week. Not worth side effects!" 2.0 November 28, 2015 43 2015 nd day mg start work rock hard erect howev experianc headach lower bowel preassur rd day erect would wake hurt leg ankl ach sever lower bowel preassur like need go enjoy initi rockhard erect side effect month suppli work xs week worth side effect 0.136111 -0.1531
6 165907 Levonorgestrel Emergency Contraception "He pulled out, but he cummed a bit in me. I took the Plan B 26 hours later, and took a pregnancy test two weeks later - - I&#039;m pregnant." 1.0 March 7, 2017 5 2017 pull cum bit took plan b hour later took pregnanc test two week later pregnant 0.111111 0.0000
7 102654 Aripiprazole Bipolar Disorde "Abilify changed my life. There is hope. I was on Zoloft and Clonidine when I first started Abilify at the age of 15.. Zoloft for depression and Clondine to manage my complete rage. My moods were out of control. I was depressed and hopeless one second and then mean, irrational, and full of rage the next. My Dr. prescribed me 2mg of Abilify and from that point on I feel like I have been cured though I know I&#039;m not.. Bi-polar disorder is a constant battle. I know Abilify works for me because I have tried to get off it and lost complete control over my emotions. Went back on it and I was golden again. I am on 5mg 2x daily. I am now 21 and better than I have ever been in the past. Only side effect is I like to eat a lot." 10.0 March 14, 2015 32 2015 abilifi chang life hope zoloft clonidin first start abilifi age zoloft depress clondin manag complet rage mood control depress hopeless one second mean irrat full rage next dr prescrib mg abilifi point feel like cure though know bi polar disord constant battl know abilifi work tri get lost complet control emot went back golden mg x daili better ever past side effect like eat lot 0.047756 -0.8442
8 74811 Keppra Epilepsy " I Ve had nothing but problems with the Keppera : constant shaking in my arms &amp; legs &amp; pins &amp; needles feeling in my arms &amp; legs severe light headedness no appetite &amp; etc." 1.0 August 9, 2016 11 2016 noth problem keppera constant shake arm leg pin needl feel arm leg sever light headed appetit etc 0.200000 -0.5267
9 48928 Ethinyl estradiol / levonorgestrel Birth Control "I had been on the pill for many years. When my doctor changed my RX to chateal, it was as effective. It really did help me by completely clearing my acne, this takes about 6 months though. I did not gain extra weight, or develop any emotional health issues. I stopped taking it bc I started using a more natural method of birth control, but started to take it bc I hate that my acne came back at age 28. I really hope symptoms like depression, or weight gain do not begin to affect me as I am older now. I&#039;m also naturally moody, so this may worsen things. I was in a negative mental rut today. Also I hope this doesn&#039;t push me over the edge, as I believe I am depressed. Hopefully it&#039;ll be just like when I was younger." 8.0 December 8, 2016 1 2016 pill mani year doctor chang rx chateal effect realli help complet clear acn take month though gain extra weight develop emot health issu stop take bc start use natur method birth control start take bc hate acn came back age realli hope symptom like depress weight gain begin affect older also natur moodi may worsen thing negat mental rut today also hope push edg believ depress hope like younger -0.085185 0.8481
In [66]:
sent_dist=pd.DataFrame({'values':[np.sum(data['vader_intensity']>0),np.sum(data['vader_intensity']<0),np.sum(data['vader_intensity']==0)]},index=['positive','negative','neutral'])
sent_dist.plot.pie(y='values')
print(sent_dist)
          values
positive  111558
negative   89143
neutral    10546

Vander Sentimenet Analysis classifies more negative comments than textblob and in general practice it has been found that Vader handles negative polaity much better than TextBlob. Let us further analyse how they work with some general medical reviews

In [67]:
print(v_intensity.polarity_scores("Taking this medicine had side effects such as headache"))
print(TextBlob("Taking this medicine had side effects such as headache").sentiment.polarity)
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
0.0

This is complex interpretation from a human prespective also

  1. In reality the medicine might have some side effects irrepsective of person, age etc. So this behaviour is expected.

  2. Having side effects of a medicine usually treated in a negative way as the patient is informed beforehand of these effects and people do not desire side effects of a medication.

Overall in my opinion side effects of medicine should be treated in negative way. However both give out neutral results.

In [68]:
print(v_intensity.polarity_scores("Taking this medicine cured my illness"))
print(TextBlob("Taking this medicine cured my illness").sentiment.polarity)
{'neg': 0.351, 'neu': 0.649, 'pos': 0.0, 'compound': -0.4019}
0.0

This is where the true problem arises, "cured my illness" is something positive. However more importance is given to illness word assigning a negative score.

TextBlob perform somewhat better however the score is still not good.

In [69]:
print(v_intensity.polarity_scores("I have Diarrhea and taking this medicine had no effect"))
print(TextBlob("I have Diarrhea and taking this medicine had no effect").sentiment.polarity)
{'neg': 0.196, 'neu': 0.804, 'pos': 0.0, 'compound': -0.296}
0.0

A neutral sentence where TextBlob performed better.

In [70]:
print(v_intensity.polarity_scores("I have Diarrhea and taking this medicine made it more worse"))
print(TextBlob("I have Diarrhea and taking this medicine made it more worse").sentiment.polarity)
{'neg': 0.253, 'neu': 0.747, 'pos': 0.0, 'compound': -0.5256}
0.04999999999999999

A negative review where Vader perfomed better.

In [71]:
print(v_intensity.polarity_scores("After taking Levonorgestrel, I had migraine and vomitigs"))
print(TextBlob("After taking Levonorgestrel, I had migraine and vomitigs").sentiment.polarity)
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
0.0

Not a reliable results let us modify this sentence slightly and see what do we get.

In [72]:
print(v_intensity.polarity_scores("After taking Levonorgestrel, I had migraine and vomitigs hence I would not recommend it to anyone suffering from cough"))
print(TextBlob("After taking Levonorgestrel,I had migraine and vomitigs hence I would not recommend it to anyone suffering from cough").sentiment.polarity)
{'neg': 0.235, 'neu': 0.765, 'pos': 0.0, 'compound': -0.6381}
0.0

Vader outperforms and assigns correctly.

In [73]:
text="""Suboxone has completely turned my life around. I feel healthier, I'm excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you're ready to stop, there's a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I'm actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin."""
In [74]:
print(v_intensity.polarity_scores(text))
print(TextBlob(text).sentiment.polarity)
{'neg': 0.061, 'neu': 0.771, 'pos': 0.168, 'compound': 0.9403}
0.19444444444444445

Vader for the win in the above example. This example is taken from the review without doing pre-processing. While the pre-processing score is 0.89

Comments on Sentiment Analysis

Naturally it was expected that those with rating >7 or 8 would have strongly positive reviews, those with rating 4-5 would be netural and < 4 would be strongly negative.

Some things to note here are :

  1. We are dealing with medical textual data and in these scenarios vader & textblob both might lack some predefined rules for medical terms. For example some medical conditions such as acne, rashes, shaking of arms and legs ,drug names, weight gain might not be included well within the libraries. While a general human reading the reviews might understand the sentiment it will be really hard to depend on the results based generic words only.

  2. TextBlob and Vader work with general purpose twitter and movie reviews data.

  3. We have cleaned to the sentences to reduced the words as sentiment analysis works will less words also.

Final Conclusion on Sentiment Analysis

  1. Vader peforms better than TextBlob overall.
  2. However both are highly unreliable and as such we will not be using the sentiment scores to create our target variable. We will use the actual ratings as such to create the target column.
  3. A customer analyzer which includes some medical terms,medical conditions,things such as side effects etc. should be included for best results.

Creating our Target Column

We are going to use the threshold rating of 5 for giving the class, The review will have a positive sentiment (1) if rating > 5 and negative sentiment otherwise.

In [75]:
data['rating_label'] = data['rating'].apply(lambda x: 1 if x > 5 else 0)
data.head(10)
Out[75]:
Unnamed: 0 drugName condition review rating date usefulCount year review_clean sentiment_polarity vader_intensity rating_label
0 206461 Valsartan Left Ventricular Dysfunction "It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil" 9.0 May 20, 2012 27 2012 side effect take combin bystol mg fish oil 0.000000 0.0000 1
1 95260 Guanfacine ADHD "My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. \r\nWe have tried many different medications and so far this is the most effective." 8.0 April 27, 2010 192 2010 son halfway fourth week intuniv becam concern began last week start take highest dose two day could hard get bed cranki slept near hour drive home school vacat unusu call doctor monday morn said stick day see school get morn last two day problem free much agreeabl ever less emot good thing less cranki rememb thing overal behavior better tri mani differ medic far effect 0.114583 0.6929 1
2 92703 Lybrel Birth Control "I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it&#039;s the end of the third week- I still have daily brown discharge.\r\nThe positive side is that I didn&#039;t have any other side effects. The idea of being period free was so tempting... Alas." 5.0 December 14, 2009 17 2009 use take anoth oral contracept pill cycl happi light period max day side effect contain hormon gestoden avail us switch lybrel ingredi similar pill end start lybrel immedi first day period instruct said period last two week take second pack two week third pack thing got even wors third period last two week end third week still daili brown discharg posit side side effect idea period free tempt ala 0.105000 0.5106 0
3 138000 Ortho Evra Birth Control "This is my first time using any form of birth control. I&#039;m glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch" 8.0 November 3, 2015 10 2015 first time use form birth control glad went patch month first decreas libido subsid downsid made period longer day exact use period day max also made cramp intens first two day period never cramp use birth control happi patch 0.300000 0.4199 1
4 35696 Buprenorphine / naloxone Opiate Dependence "Suboxone has completely turned my life around. I feel healthier, I&#039;m excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you&#039;re ready to stop, there&#039;s a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I&#039;m actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin." 9.0 November 27, 2016 37 2016 suboxon complet turn life around feel healthier excel job alway money pocket save account none suboxon spent year abus oxycontin paycheck alreadi spent time got start resort scheme steal fund addict histori readi stop good chanc suboxon put path great life found side effect minim compar oxycontin actual sleep better slight constip truli amaz cost pale comparison spent oxycontin 0.147037 0.8934 1
5 155963 Cialis Benign Prostatic Hyperplasia "2nd day on 5mg started to work with rock hard erections however experianced headache, lower bowel preassure. 3rd day erections would wake me up &amp; hurt! Leg/ankles aches severe lower bowel preassure like you need to go #2 but can&#039;t! Enjoyed the initial rockhard erections but not at these side effects or $230 for months supply! I&#039;m 50 &amp; work out 3Xs a week. Not worth side effects!" 2.0 November 28, 2015 43 2015 nd day mg start work rock hard erect howev experianc headach lower bowel preassur rd day erect would wake hurt leg ankl ach sever lower bowel preassur like need go enjoy initi rockhard erect side effect month suppli work xs week worth side effect 0.136111 -0.1531 0
6 165907 Levonorgestrel Emergency Contraception "He pulled out, but he cummed a bit in me. I took the Plan B 26 hours later, and took a pregnancy test two weeks later - - I&#039;m pregnant." 1.0 March 7, 2017 5 2017 pull cum bit took plan b hour later took pregnanc test two week later pregnant 0.111111 0.0000 0
7 102654 Aripiprazole Bipolar Disorde "Abilify changed my life. There is hope. I was on Zoloft and Clonidine when I first started Abilify at the age of 15.. Zoloft for depression and Clondine to manage my complete rage. My moods were out of control. I was depressed and hopeless one second and then mean, irrational, and full of rage the next. My Dr. prescribed me 2mg of Abilify and from that point on I feel like I have been cured though I know I&#039;m not.. Bi-polar disorder is a constant battle. I know Abilify works for me because I have tried to get off it and lost complete control over my emotions. Went back on it and I was golden again. I am on 5mg 2x daily. I am now 21 and better than I have ever been in the past. Only side effect is I like to eat a lot." 10.0 March 14, 2015 32 2015 abilifi chang life hope zoloft clonidin first start abilifi age zoloft depress clondin manag complet rage mood control depress hopeless one second mean irrat full rage next dr prescrib mg abilifi point feel like cure though know bi polar disord constant battl know abilifi work tri get lost complet control emot went back golden mg x daili better ever past side effect like eat lot 0.047756 -0.8442 1
8 74811 Keppra Epilepsy " I Ve had nothing but problems with the Keppera : constant shaking in my arms &amp; legs &amp; pins &amp; needles feeling in my arms &amp; legs severe light headedness no appetite &amp; etc." 1.0 August 9, 2016 11 2016 noth problem keppera constant shake arm leg pin needl feel arm leg sever light headed appetit etc 0.200000 -0.5267 0
9 48928 Ethinyl estradiol / levonorgestrel Birth Control "I had been on the pill for many years. When my doctor changed my RX to chateal, it was as effective. It really did help me by completely clearing my acne, this takes about 6 months though. I did not gain extra weight, or develop any emotional health issues. I stopped taking it bc I started using a more natural method of birth control, but started to take it bc I hate that my acne came back at age 28. I really hope symptoms like depression, or weight gain do not begin to affect me as I am older now. I&#039;m also naturally moody, so this may worsen things. I was in a negative mental rut today. Also I hope this doesn&#039;t push me over the edge, as I believe I am depressed. Hopefully it&#039;ll be just like when I was younger." 8.0 December 8, 2016 1 2016 pill mani year doctor chang rx chateal effect realli help complet clear acn take month though gain extra weight develop emot health issu stop take bc start use natur method birth control start take bc hate acn came back age realli hope symptom like depress weight gain begin affect older also natur moodi may worsen thing negat mental rut today also hope push edg believ depress hope like younger -0.085185 0.8481 1

Checking on class imbalance

In [76]:
data['rating_label'].value_counts().plot.pie()
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1c0ae4e50>
In [77]:
data['rating_label'].value_counts()
Out[77]:
1    148071
0     63176
Name: rating_label, dtype: int64
In [78]:
print("% of class 1 labels are ",148071/data.shape[0])
print("% of class 0 labels are",63176/data.shape[0])
% of class 1 labels are  0.7009377647966598
% of class 0 labels are 0.2990622352033402

We find that the dataset is highly imbalanced with 70:30 ratio. Even though we can run ML models on these and get somewhat desirable results , it still should be treated as a class imbalance problem and we process further to downsample the majority class.

Downsampling the dataset to make equal distribution

In [79]:
class_1=data[data['rating_label']==1]
class_0=data[data['rating_label']==0]
print(class_1.shape)
print(class_0.shape)
(148071, 12)
(63176, 12)
In [80]:
from sklearn.utils import resample
class_1_downsample=resample(class_1,replace=True,n_samples=class_0.shape[0],random_state=42)
print(class_1_downsample.shape)
(63176, 12)
In [81]:
class_1_downsample
Out[81]:
Unnamed: 0 drugName condition review rating date usefulCount year review_clean sentiment_polarity vader_intensity rating_label
15590 106834 Implanon Birth Control "I&#039;ve been on Implanon since April 2013...for the entire first year I had absolutely no period which was great! About 2 months ago I got a very heavy period. I&#039;ve been having light bleeding ever since and it&#039;s driving me nuts, but overall this is the best method of birth control for me. And I&#039;ve tried everything! Side effects are different for everyone and it can always be taken out if it doesn&#039;t work for you. I&#039;d say if you&#039;re forgetful or just prefer a form of birth control that requires no effort don&#039;t hesitate to give this a try I love it!" 7.0 December 3, 2014 2 2014 implanon sinc april entir first year absolut period great month ago got heavi period light bleed ever sinc drive nut overal best method birth control tri everyth side effect differ everyon alway taken work say forget prefer form birth control requir effort hesit give tri love 0.590000 0.9118 1
52010 65311 Yasmin Acne "It took six months to my skin to get clear. The first 4 months were the worst, but now, I don" 10.0 August 13, 2015 5 2015 took six month skin get clear first month worst -0.216667 -0.3612 1
30180 56588 Elavil Anxiety and Stress "This medication has been very effective for my anxiety. I&#039;ve taken 75mg for the past 9yrs. No more heart palpitations or tightness in my chest. I have tried others prior to the Elavil, but found I woke up with a hangover and a out of body feeling. If Ii forget to order refill before I run out it takes a 3 days to feel the effects. " 10.0 September 23, 2014 160 2014 medic effect anxieti taken mg past yrs heart palpit tight chest tri other prior elavil found woke hangov bodi feel ii forget order refil run take day feel effect -0.138393 0.5106 1
150187 225640 Bupropion Smoking Cessation "After 25 years of heavy smoking I quit in five days! (Though I enjoyed smoking so much!) This happened 13 years ago. Since then - no cravings. My husband followed my example: the same result seven days later." 10.0 August 9, 2013 91 2013 year heavi smoke quit five day though enjoy smoke much happen year ago sinc crave husband follow exampl result seven day later 0.200000 0.4939 1
12568 50313 Gabapentin Anxiety "I am 47 with such an exhausting history of anxiety that had led me to self medicate with alcohol, until finally finding the correct medical doctor.\r\nDoctors in the past have been writing me scrips that have been making me crazier and crazier, just adding to my problem. Now I am alcohol free with no panic attacks, I have no problems being by myself and keeping occupied with work and fun things to do. I sleep like a baby all through the night. It&#039;s wonderful." 10.0 February 28, 2014 160 2014 exhaust histori anxieti led self medic alcohol final find correct medic doctor doctor past write scrip make crazier crazier ad problem alcohol free panic attack problem keep occupi work fun thing sleep like babi night wonder 0.112500 -0.6249 1
... ... ... ... ... ... ... ... ... ... ... ... ...
112174 65828 Propranolol Anxiety "I have been suffering from anxiety all my life and I was prescribed propranolol 20 mg by my doctor in November to help me get through a rough time right now (working, grad school, internship). I was getting panic attacks everyday by the stress and it is amazing how well it works! I love how I get no side effects! I take it once a day in the morning and it&#039;s been working wonderfully! I just need 1 to help me get through my whole day without any physical nervous symptoms and it actually is improving my mood; so much more confident and happy which I guess is because I know anxiety will not take over my life anymore!!!" 10.0 January 15, 2016 50 2016 suffer anxieti life prescrib propranolol mg doctor novemb help get rough time right work grad school internship get panic attack everyday stress amaz well work love get side effect take day morn work wonder need help get whole day without physic nervous symptom actual improv mood much confid happi guess know anxieti take life anymor 0.126531 -0.0480 1
55383 14473 Buprenorphine / naloxone Opiate Dependence "This is the wonder drug. I was on Vicodin and oxycodone for over 6 years from back surgery. I found a doctor who helped me get off them and give me life again. It saved me from losing my family and my friends. Suboxone starting working immediately after the first dose. Instantly I was feeling better about my life. If in doubt don&#039;t be, admit you have a problem and get some help. Read all the above comments, they are true." 10.0 April 6, 2009 12 2009 wonder drug vicodin oxycodon year back surgeri found doctor help get give life save lose famili friend suboxon start work immedi first dose instant feel better life doubt admit problem get help read comment true 0.220000 0.8860 1
24399 183640 Cymbalta Anxiety "I suffer from medical anxiety. I first tried Paxil - itched like crazy. Then lexapro - shook all the time. Cymbalta has changed my life. I feel so much better now, and almost back to &#039;normal&#039; after 5 weeks on the medication. I have a little insomnia, and have lost a few pounds, but other than that I feel like I&#039;m 30 again, instead of 50!" 10.0 September 28, 2009 57 2009 suffer medic anxieti first tri paxil itch like crazi lexapro shook time cymbalta chang life feel much better almost back normal week medic littl insomnia lost pound feel like instead 0.225000 0.1779 1
95549 93643 Morphine Chronic Pain "After years of just plain Percocet 10/325, Ms-Contin is a miracle. I can actually feel like a human being again. I&#039;ve had C3, 4 &amp; 5 fused. C1 &amp; 2 are in bad shape and they are scared to perform any more surgery. My left rotator cuff is &quot;gone&quot; and irreparable. Rothman Institute said it will only get worse, nothing more can be done. Right rotator cuff is not as bad, still have 2 surviving (barely) tendons. Both knees are destroyed and will soon need replaced. Ms-Contin with 10mg Oxycodone 3 x daily for breakthrough pain have given me my life back again, (as much as possible.)" 9.0 March 30, 2013 75 2013 year plain percocet ms contin miracl actual feel like human c fuse c bad shape scare perform surgeri left rotat cuff gone irrepar rothman institut said get wors noth done right rotat cuff bad still surviv bare tendon knee destroy soon need replac ms contin mg oxycodon x daili breakthrough pain given life back much possibl -0.116234 -0.9382 1
30130 112665 Lunesta Insomnia "I have been on Lunesta, 2mg, for almost a year now and by far, this is the best sleep medication out there. Although, with this medicine I have had little to no side effects at all. Only thing is, I have had sleep walking occurrences but overall, this is an excellent sleep aid and pricey too." 10.0 April 29, 2011 8 2011 lunesta mg almost year far best sleep medic although medicin littl side effect thing sleep walk occurr overal excel sleep aid pricey 0.550000 0.7867 1

63176 rows × 12 columns

In [82]:
data_downsampled=pd.concat([class_1_downsample,class_0])
print(data_downsampled.shape)
print(data_downsampled['rating_label'].value_counts())
data_downsampled['rating_label'].value_counts().plot.pie()
(126352, 12)
1    63176
0    63176
Name: rating_label, dtype: int64
Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc1c1d92d10>

Comments on class imbalance

We have downsampled the majority class to have a equal distribution of data (50:50). We will evaluate the performances of both the datasets ( balanced and unbalanced) in Q4 and check the results.

Summary of Data Pre-processing done in Q3.

  1. We cleaned the textual data to get rid of unnecessary common words, remove punctuations etc.

  2. We performed Sentiment Analysis to understand the nature of reviews and the distribution of the reviews

  3. Created our target variable based on threshold of 0.5 for a binary classification.

  4. Checked on the dataset imbalance and found that we had ratio of imbalance of 70:30 ( class 1 : class 0)

  5. Downsampled dataset to bring to 50:50 dataset to remove class imbalance.

We now have 2 datasets, let us now move to Q4 and build the models.

Q4 Implement 2 machine learning models, explain which algorithms you have selected and why. Compare them and show success metrics (Accuracy/RMSE/Confusion Matrix) as per your problem. Explain results.

We will primarily use two algos:

  1. Random Forest - This algo hands down works best with many of the classification problems seen in practice. A ensemble classifier based on number of decisions trees.

  2. Naive Bayes- i) quite fast, ii) known to work well with textual data. Though it requires integer features, in pratice can work will tf-idf also.

We can run another supervised classification problems like KNN, SVM , Logisitic Regression, SVC etc . KNN is ofcourse out of bounds for such a large dataset with so many features.

Since we cannot feed raw textual data we will use tf-idf vectorizer to convert our textual data to vectorized format.

In [83]:
from sklearn.model_selection import train_test_split #import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer #import TfidfVectorizer 
from sklearn.metrics import confusion_matrix #import confusion_matrix
from sklearn.naive_bayes import MultinomialNB #import MultinomialNB
from sklearn.ensemble import RandomForestClassifier  #import RandomForestClassifier
In [84]:
# Creates TF-IDF vectorizer and transforms the corpus
vectorizer = TfidfVectorizer()
reviews_corpus = vectorizer.fit_transform(data.review_clean)
reviews_corpus.shape
Out[84]:
(211247, 34579)
In [85]:
#checking how our sparse matrix looks like 
reviews_corpus[0:100].toarray()
Out[85]:
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
In [86]:
#not of relevance but just to check that all elements are not zero
np.max(reviews_corpus[0:100].toarray())
Out[86]:
0.7641746820118535
In [87]:
sentiment_data=data['rating_label']
sentiment_data.shape
Out[87]:
(211247,)
In [88]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(reviews_corpus,sentiment_data,test_size=0.30)
print('Train data shape ',X_train.shape,Y_train.shape)
print('Test data shape ',X_test.shape,Y_test.shape)
Train data shape  (147872, 34579) (147872,)
Test data shape  (63375, 34579) (63375,)

MNB Model

In [89]:
clf = MultinomialNB().fit(X_train, Y_train) #fit the training data

pred = clf.predict(X_test) #predict the sentiment for test data

print("Accuracy: %s" % str(clf.score(X_test, Y_test))) #check accuracy
print("Confusion Matrix") 
print(confusion_matrix(Y_test,pred)) #print confusion matrix
Accuracy: 0.7573491124260355
Confusion Matrix
[[ 4259 14794]
 [  584 43738]]

Random Forest Classifier

In [90]:
#fit the model and predicct the output

clf = RandomForestClassifier().fit(X_train, Y_train)

pred = clf.predict(X_test)

print("Accuracy: %s" % str(clf.score(X_test, Y_test)))
print("Confusion Matrix")
print(confusion_matrix(Y_test, pred))
Accuracy: 0.8999605522682446
Confusion Matrix
[[13137  5916]
 [  424 43898]]
In [91]:
from sklearn.metrics import * 
print('f1 score is',f1_score(Y_test,pred))
print('recall score is',recall_score(Y_test,pred))
print('precision score is',precision_score(Y_test,pred))
f1 score is 0.9326506331265403
recall score is 0.990433644691124
precision score is 0.8812382061267917

Results with using downsampled dataset

In [92]:
# Creates TF-IDF vectorizer and transforms the corpus
vectorizer_downsampled = TfidfVectorizer()
reviews_corpus_downsampled = vectorizer.fit_transform(data_downsampled.review_clean)
reviews_corpus_downsampled.shape
Out[92]:
(126352, 27626)
In [93]:
sentiment_data_downsampled=data_downsampled['rating_label']
sentiment_data_downsampled.shape
Out[93]:
(126352,)
In [94]:
X_train_down,X_test_down,Y_train_down,Y_test_down = train_test_split(reviews_corpus_downsampled,sentiment_data_downsampled,test_size=0.30)
print('Train data shape ',X_train_down.shape,Y_train_down.shape)
print('Test data shape ',X_test_down.shape,Y_test_down.shape)
Train data shape  (88446, 27626) (88446,)
Test data shape  (37906, 27626) (37906,)
In [95]:
clf_down = MultinomialNB().fit(X_train_down, Y_train_down) #fit the training data

pred_down = clf_down.predict(X_test_down) #predict the sentiment for test data

print("Accuracy: %s" % str(clf_down.score(X_test_down, Y_test_down))) #check accuracy
print("Confusion Matrix") 
print(confusion_matrix(Y_test_down, pred_down)) #print confusion matrix
Accuracy: 0.7732285126365219
Confusion Matrix
[[14772  4218]
 [ 4378 14538]]
In [96]:
#fit the model and predict the output

clf_rf_down = RandomForestClassifier().fit(X_train_down, Y_train_down)

pred_rf_down = clf_rf_down.predict(X_test_down)

print("Accuracy: %s" % str(clf_rf_down.score(X_test_down, Y_test_down)))
print("Confusion Matrix")
print(confusion_matrix(Y_test_down,pred_rf_down,))
Accuracy: 0.8851105365905134
Confusion Matrix
[[16999  1991]
 [ 2364 16552]]
In [97]:
print('f1 score is',f1_score(Y_test_down,pred_rf_down))
print('recall score is',recall_score(Y_test_down,pred_rf_down))
print('precision score is',precision_score(Y_test_down,pred_rf_down))
f1 score is 0.88373955524707
recall score is 0.8750264326496088
precision score is 0.8926279458555789

Conclusion on ML Models & Dataset

  • Random Forest outperforms MNB with both dataset
  • MNB on balanced dataset performs much better than unbalanced dataset.
  • Random Forest on the unbalanced dataset has higher scores in all areas expect precision for the balanced dataset.

We necessarily do not improve our model performace even if we have balanced the dataset. One obvious thing is that when we downsampled we have lost some import data our review_corpus changed shaped from 36k to 27k after doing tf-idf.

Our objective : Predict rating based on the review data.

If we were to solely choose based on f1-score then RandomForest trained on unbalanced works better than the balanced set. Had it been to predict condition based on review data we would have worked more towards having a higher recall (in my opinion). However we will stick with a balanced data for AutoML also.

Download the dataset to feed into AutoML

In [98]:
data={'clean_review':data_downsampled['review_clean'],'target':data_downsampled['rating_label']}
data_clean=pd.DataFrame(data=data)
In [99]:
data_clean.head()
Out[99]:
clean_review target
15590 implanon sinc april entir first year absolut period great month ago got heavi period light bleed ever sinc drive nut overal best method birth control tri everyth side effect differ everyon alway taken work say forget prefer form birth control requir effort hesit give tri love 1
52010 took six month skin get clear first month worst 1
30180 medic effect anxieti taken mg past yrs heart palpit tight chest tri other prior elavil found woke hangov bodi feel ii forget order refil run take day feel effect 1
150187 year heavi smoke quit five day though enjoy smoke much happen year ago sinc crave husband follow exampl result seven day later 1
12568 exhaust histori anxieti led self medic alcohol final find correct medic doctor doctor past write scrip make crazier crazier ad problem alcohol free panic attack problem keep occupi work fun thing sleep like babi night wonder 1
In [100]:
data_clean.shape
Out[100]:
(126352, 2)
In [101]:
data_clean.to_csv('uci_data_review_clean.csv')

Q5. Use Automated ML for your data set. Explain best model results.

  1. Screenshot of Completed Job Run

image.png

  1. Data Check/Guardrails

image.png

  1. Model outputs of AutoML

image.png

A total of 30 models were run by Azure ML with train: 60%, validation: 10%, test : 30%. The highest accuracy score was reached by Logisitic Regression using MaxAbsScaler Pre-processing.

  1. ROC Curve

image.png

The ROC plot is indicates a good fit of our model.

  1. Confusion Matrix

image.png

Confusion Matrix depicted above on the test set.

  1. Scores

image.png

Scores are sligthly above the the RandomForest classifier but are within comporable range. It is not clear to me why it was giving micro, macro, wieghted even though our target is just a binary classification and not a multiclass classification. I could not find any option for binary in the ML studio at all even in the official documentation with a binary classification task https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-first-experiment-automated-ml auc_weighted was chosen.

Based on defination and considering the fact our dataset is balanced, I think it is better to take macro scores into consideration as it just averages for classes.

  1. HyperParameter used

image.png

Screenshot giving the hyperparameters used by Aut oML. Again I am not sure why the AutoML used "multinomial" however I understand this is more treated for cross-entropy loss.

  1. Data Transformation done by Azure

image.png

So azure relied of Tf-idf also. However it has done some more pre-processing as the sparse matrix has 76468 columns and used MaxAbsScaler to scale the sparse matrix.

Let us see how the same model works on a reduced dataset features - we do not have 764648 columns!!

In [102]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MaxAbsScaler

#fit the model and predict the output
max_abs_scaler=MaxAbsScaler()
X_train_maxabs=max_abs_scaler.fit_transform(X_train_down)
X_test_maxabs=max_abs_scaler.transform(X_test_down)
lr=LogisticRegression(penalty="l2",C=6866.488450042998,class_weight="balanced",multi_class="multinomial",solver="saga")
clf_lr = lr.fit(X_train_maxabs,Y_train_down)
pred_rf_down = clf_lr.predict(X_test_maxabs)
print("Accuracy: %s" % str(clf_lr.score(X_test_maxabs, Y_test_down)))
print("Confusion Matrix")
print(confusion_matrix(Y_test_down,pred_rf_down,))
Accuracy: 0.8179443887511212
Confusion Matrix
[[15652  3338]
 [ 3563 15353]]
/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_sag.py:354: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  ConvergenceWarning,
In [103]:
from sklearn.metrics import * 
print('f1 score is',f1_score(Y_test_down,pred_rf_down))
print('recall score is',recall_score(Y_test_down,pred_rf_down))
print('precision score is',precision_score(Y_test_down,pred_rf_down))
f1 score is 0.8164969287632621
recall score is 0.8116409388877142
precision score is 0.8214113744582955

Logistic Regression performs much better than Naive Bayes done in Q4. The scores metrics are lower than random forest but the significant digits are comparable. We cannot compare this directly with Azure ML as the Azure has transformed to 764688 features whereas our corpus shape is just 27626 which is less than 50% sparse matrix values.

Finally we have two models in comparison :

  1. Our balanced datasetwith sparse matrix of 27626 columns giving best results with RandomForest with accuracy of 0.88

  2. Balanced dataset fed to Azure AutoML generated a sparse matrix of 76468 columns giving best results with LogisiticRegression with accuracy of 0.89.

  3. Both provide near comparision results though Logisitic Regression has much better scores (f1,recall,precision) compared to randomforest

What more can be done ?

  1. Enable Deep Learning and word embeddings and check if the results are any better. Neural networks can work much better on large scale data.
In [ ]:

In [103]: